Before you write a content policy, learn to read one — how a frontier lab turns "be safe" into a document
Day 6 of 60
Last week you learned to threat-model: "what could go wrong, for whom?" This week you answer the question that immediately follows — "and what exactly do we do about it?" The answer is never a vibe. It's a document: a safety taxonomy that names the categories of harm, a severity scheme that ranks them, and a policy that says what the model should refuse, allow, or safely complete. Every filter, every eval, every human reviewer downstream is measured against that document. Authoring it well is one of the highest-leverage things a safety practitioner does.
"Is this output harmful?" is meaningless until someone defines the categories, the tiers, and the edge rules. The policy is the contract. A label is only as good as the policy it inherited; a refusal is only as defensible as the rule behind it. Before you write one, you read the best ones in the world.
So today is deliberately a reading day. You're going to study two real, published policies from frontier labs and reverse-engineer their structure — because the fastest way to write a good taxonomy is to see exactly how the people who do this for a living shaped theirs.
The clearest example of policy-as-document is the OpenAI Model Spec. It's not a vague mission statement — it's a behavioral specification with a chain of command (platform rules > developer rules > user requests > defaults), explicit defaults, and worked examples of how the model should resolve conflicts. Crucially, it states how the model should trade helpfulness against safety, rather than pretending the two never collide.
A real policy answers "what if the user asks for something the platform forbids?" up front. The Model Spec's chain of command makes the precedence explicit, so the model isn't improvising priority under pressure.
Most interactions aren't edge cases. A policy specifies the default posture (assume good intent, be helpful, ask for clarification) so the common path is defined, not accidental.
The hardest part of any policy is the borderline. A mature spec says explicitly what to refuse, what to allow, and what to safe-complete (answer partially / with caveats) — because a policy that refuses everything borderline is as broken as one that allows real harm.
As you browse the Model Spec, keep asking: why did they phrase it this way? Notice where a rule is written to be actionable (a reviewer could apply it consistently) versus aspirational. That distinction is the whole craft — and it's exactly what you'll imitate next week.
Where the Model Spec governs behavior, an acceptable-use policy governs permitted use — and it's where you'll see the category structure your own taxonomy will mirror. Read the Anthropic Usage Policy and pay attention to how each prohibited category is defined to be actionable: not "don't be harmful," but specific, decidable categories a reviewer can apply without re-litigating intent every time.
That actionability is the difference between a policy that scales to a team and one that lives only in the author's head. A good category definition is one where two trained reviewers, given the same item, reach the same verdict. That property — inter-rater agreement — is what you're really designing for.
By the end of today you should be able to list them from memory: (1) named categories of harm, (2) a precise definition per category, (3) severity tiers, (4) worked examples (and ideally benign look-alikes), and (5) a routing/refusal rule for what to do when each one fires. You'll build all five this week.
A practitioner reads a safety policy to learn the rules. An expert reads it to learn the design decisions: why is this category split from that one, why is the threshold here, why is this phrased to be decidable rather than merely correct? The altitude jump is from following a policy to being able to author and defend one — and the fastest path there is reverse-engineering the best published examples.
Say this in an interview: "I treat a content policy as the contract every downstream filter and reviewer inherits, so I study real specs — the OpenAI Model Spec's chain of command, a lab AUP's category structure — for their design decisions, not just their rules. The test I hold a category to is whether two trained reviewers would agree on it."